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ABSTRACT: 


Axioms  for  a  total  complexity  measure  for  abstract  programs 
are  presented.  Essentially,  they  require  that  total  complexity 
be  an  unbounded  increasing  function  of  the  Blum  time  and  size 
measures.  Algorithms  for  finding  the  best  program  on  a  finite 
domain  are  presented,  and  their  limiting  behaviour  for 
infinite  domains  described.  For  total  complexity,  there  are 
important  senses  in  which  a  machine  can  find  the  best  program 
for  a  large  class  of  functions. 


This  research  was  supported  in  part  by  the  National  Science  Foundation 
and  the  Advanced  Research  Projects  Agency. 

The  views  and  conclusions  contained  in  this  document  are  those  of  the 
author  and  should  not  be  interpreted  as  necessarily  representing  the  official 
policies,  either  expressed  or  implied,  of  the  Advanced  Research  Projects 
Agency  or  the  National  Science  Foundation. 


Reproduced  in  the  USA.  Available  from  the  National 
tion  Service,  Springfield,  Virginia  22151.  ^rf m  Fa 

w— ?  i  ■iinm  iu  i|uiw> 


Technical 


Informa- 


II 


t 


We  are  primarily  concerned,  in  this  paper,  with  the  question  of 
when  a  machine  can  learn  a  program  from  samples  of  its  input -output 
pairs.  This  problem  of  program  inference  is  closely  related  to  the 
problem  of  grammatical  inference,  which  has  received  a  fair  amount  of 
consideration  [  2  ].  There  are,  in  the  grammatical  inference  literature, 
many  results  and  discussions  which  can  be  carried  over  to  program 
inference.  This  paper  arose  out  of  an  attempt  to  carry  out  what  we 
believed  to  be  a  trivial  reworking  of  some  of  the  results  of  [  7  ] 
for  programs.  In  fact,  the  results  on  programs  turn  out  to  be  significant 
ly  different;  we  will  discuss  this  issue  further  below. 

We  are  interested  in  modelling  the  following  situation.  A 
machine  M  receives  at  each  time  t  ,  an  input -output  pair  (x,y) 
from  an  unknown  program  P  in  a  known  class  C,  of  programs.  At 
each  time,  the  machine  is  to  guess  some  P^Q3  as  the  best  program 
for  the  finite  number  of  input -output  pairs  seen  so  far.  V\.  show  that 
there  are  reasonable  conditions  under  which  M  can  guess  the  best 
program  at  each  finite  time  and  also  have  good  behaviour  in  the  limit. 

To  do  this,  we  need  a  formal  notion  of  "best"  program. 

The  key  to  our  development  is  the  combined  complexity  measure 
including  both  program  size  and  running  time.  Many  of  the  difficulties 
arising  in  other  axiomatic  treatments  of  complexity  are  elided  in  the 
combined  complexity  approach. 

More  formally,  our  lesults  will  be  formulated  for  programs.  A 
program  can  be  taken  to  be  any  formal  computational  scheme  for 
evaluating  a  recursive  function  ,  such  as  a  Turing  machine  descrip¬ 
tion.  To  simplify  the  discussion  it  is  assumed  that  the  input  and 
output  of  a  program  are  both  positive  integers.  The  graph  Jr( P) 


1 


of  a  program  P  is  the  set  of  all  pairs  (x,y)  such  that  P  is 
defined  for  x  and  the  output  of  P  given  the  input  x  is  y  . 

A  sample  S  of  a  program  P  is  a  finite  nonempty  subset  of  ^(P) 

The  class  Q,  denotes  a  class  of  program?;  which  can  be  effectively 
enumerated  by  an  admissible  [17]  enumeration,  such  as  the  class  of  all 
Turing  machines,  the  class  of  FORTRAN  programs,  or  the  class  of  loop 
programs  [19].  An  inference  machine  M  =  is  any  formal  effective 
procedure  for  inferring  programs  from  finite  samples,  that  is,  M 
is  defined  on  the  set  of  samples  (S)  of  programs  in  Q,  and  M(S) 
is  a  program  in  (],  .  We  will  always  require  that  S  is  a  sample  of 
M(s)  ,  that  is 

(1)  i(M(S))sS 

Various  complexity  measures  have  been  discussed,  in  particular 
program  running  time  and  program  size  (see  [12]  for  a  discussion  of 
recent  results).  We  wish  to  discuss  measures  of  program  complexity 
which  take  into  account  both  the  size  and  running  time  of  programs. 

The  simplest  such  measure  is  the  product  of  size  and  running  time. 

Other  measures  are  also  useful.  In  order  to  obtain  general  results 
we  shall  describe  a  complexity  measure  as  any  function  satisfying  a 
simple  set  of  axioms.  The  axioms  for  size  and  running  time  are  the 
same  as  those  discussed  in  [12 ]>  while  the  axioms  for  a  combined 
complexity  measure  are  equivalent  to  those  in  [ 7  ]. 

First  we  assume  that  the  program  size  or  length  L  =  L„ 

o 

satisfies  the  conditions 

(2)  There  is  an  effective  admissible  enumeration  (Pn)  such 
such  that 
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(a)  r(n)  -  L(Pn)  is  a  recursive  positive  integer  valued 
total  function 

0=)  For  each  n  ,  the  set  Kn  =  (m|r(m)  =  n)  Is  tlalte 

(o)  The  function  r(n)  .  cardinality  of  Kn  is  a  recursive 
function. 

The  running  time  T(x,y,P)  is  a  positive  effectively  computable 
rational  function  and  is  defined  if  and  only  if  (x,y>  is  in  tte 

graph  of  P  .  There  is  a  related  recursive  function 

d(x,y,P,m)  =  10  if  T(x,y,P)  <  m] 

/ 1  otherwise 

We  also  assume  that  the  combined  running  time  T(S,P)  is  of  the  form 

(3)  T(S,P)=cp(u  (T(x,y,P))) 

(x,y)€S 

where  cp  is  a  recursive  function.  The  related  function 

D(S,P,m)  =  \  0  if  T(S,P)  < 

/ 1  otherwise 

is  then  recursive. 

let  c  be  a  positive  recursive  rational  valued  function  of  two  no: 
negative  rational  variables  which  is  increasing  and  unbounded  in  each 
variable,  the  complexity  measure  C  =  is  then  given  by 

C(S,P)  =  c(L(P),T(S,P))  ,  Sci(P) 


“amples  The  Slze  L(P)  mieht  be  the  number  of  symbols  used  to  write 
the  program  in  some  alphabet  or  the  number  of  symbols  on  the  tape  of 

a  universal  Turing  machine  needed  to  describe  a  simulation  of  the 
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program.  Some  plausible  L(P)  are  excluded  because  of  the  require¬ 
ment  that  there  be  only  a  finite  number  of  programs  of  each  size. 

For  example,  the  number  of  statements  in  a  FORTRAN  program  or  the 
nesting  depth  of  loop  programs  would  not,  as  normally  defined, 
satisfy  (2b).  size  measures  which  take  structure  into  account  are 
discussed  in  f  2,  6  J  for  grammars. 

For  a  given  pair  (x,y)  the  running  time  T(x,y,P)  could  be 
the  time  the  program  P  uses  to  derive  output  y  from  input  x 
(possibly  also  including  the  time  for  reading  x  and  printing  y). 
Other  possibilities  are  the  number  of  moves  or  number  of  tape  cells 
scanned  by  a  Turing  machine,  the  number  of  instructions  executed  by 
the  program.  One  can  also  normalize  by  some  function  of  x  and  y  , 
for  example,  T(x,y,P)  could  be  actual  running  time  divided  by 

xy  • 


The  general  function  T(S,P)  can  be  obtained  from  T(x,y,P) 
in  many  ways,  for  example  we  could  take  T(S,P)  as 


max  T(x,y,P) 
(x,y)€S 


»  or  T(x,y,P) 

U,y)€S 


or  as  an  average  of  T(x,y,P)  ,  (x,y)€S  . 

The  possibilities  for  the  function  c(L,T)  are  very  large,  for 
example  each  of  the  following  satisfy  the  hypotheses  for  c  : 

(L+1)(T+1)  ,  L+T  ,  (L+1)(T+1) 


Notice  that  the  simple  product  LT  doesn't  satisfy  the  hypotheses  for 
it  is  not  unbounded  in  L  when  T=0  .  We  impose  this  requirement  so 
as  to  simplify  some  later  arguments.  The  very  general  nature  of  the 
function  c  precludes  the  possibility  that  all  complexity  measures  are 
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recursively  related,  a  result  which  is  true  both  for  the  length 
L(P)  and  time  T(x,y,P)  .  (see  [12]) 

Remark  1 

Although  the  results  below  are  quite  general,  some  care  must  be 

used  in  applying  them  to  actual  inference  situations.  A  major  considera 

tion  is  to  choose  measures  which  do  not  degenerate  into  strictly  time 

or  strictly  size  in  the  limit.  For  example,  £  T(x,y,P) 

(x,y)€S 

may  be  unbounded  is  S  gets  large  or  the  average  of  (time/length) 
may  go  to  zero  with  large  S  .  Depending  on  the  choice  of 
c(L(P),  T(S,P) )  either  situation  could  lead  to  degeneracy.  One 

must  also  choose  complexity  functions  which  reflect  the  intuitive 
meaning  of  the  problem. 

Our  later  proofs  make  use  of  the  fact  that  the  programs  can  be 
ordered  in  terms  of  increasing  size.  An  Occam1 s  enumeration  of  £ 

— latlV'e  ^  ^  is  an  admissible  enumeration  f P± }  satisfying 
(4)  L^)  <L(P .)  if  i<j.  • 

It  is  obvious  from  (2)(b),  (c)  that  a  machine  can  find  an  Occam's 

enumeration  relative  to  L  .  One  consequence  of  this  is  the  following 
simple  result: 

Lemma_l  Given  a  complexity  measure  C  =  c(L,T)  on  the  infinite  class 
2  and  an  Occam’s  enumeration  of  relative  to  L  then  for  any 

sample  S  of  some  P  €  C  ,  there  is  an  index  k  such  that  if 
j  >  k  then  either 
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(5)(a) 


C(S,Pj)  >  C(S,P) 
or 

(b)  S  is  not  a  sample  of  P.  . 

J 

Proof  This  is  a  consequence  of  the  assumption  that  c  is  increasing 
and  unbounded  in  each  variable.  We  merely  choose  k  as  the  first 
index  for  which 

c(L(Pk),  0)  >  C(S,P) 

If  j  >  k  and  S  is  a  sample  of  then  (4)  guarantees  that 

L(V  -L(pk) 

and  hence 

C(S.Pj)  =  c(L(P,),  T(S,Pj ) 

>  c(L(Pj),  0)  >  C(S,P) 

This  proves  the  lemma. 

Now  we  prove  the  following  general  theorem. 

Theorem  1  Given  a  complexity  measure  C(S,P)  on  a  class  there 
is  an  inference  machine  M  =  which  infers  programs  of  minimum 
complexity,  that  is,  if  S  is  a  sample  of  some  program  in  (*,  , 
then  S  is  a  sample  of  M(s)  and  for  all  BgJ  for  which  S  is  a 
sample  of  (p) 

(6)  C(S,M(S) )  <  C(S,P) 
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Proof  Ihe  intuitive  idea  for  the  proof  is  as  follows:  Run  P  ,  P,  ...  p 

1  t  is 

on  S  for  time  t  ,  successively  incrementing  t  until  some  P  .  i  <  t 

1  ’  - 

runs  successfully  in  time  t  .  Then  one  need  look  ivt  no  programs 
whose  total  complexity  exceeds  C(P^,S)  ,  hence  one  need  examine  only 
a  xinite  set  of  programs  (cf.  Lemma  1)  and  pick  the  bed  one. 

To  formally  construct  M  we  first  assume  an  Occam's  enumeration 
for  d,  relative  to  length  L  .  Then 

StefcJ.  Calculate  D(S,Pi,t)  ,  1  <  i  <  t  .  If  D(S,Pi,t)=l  for 
1  <  i  <  t  ,  increment  t  by  1  and  repeat  Step  1.  Otherwise  let 
tg  be  the  first  t  for  which  D(S,P^,t)=0  for  some  1  <  i  <  t 
and  let"  iQ  be  the  first  i  ,  1  <  i  <  tQ  for  which  D(S,Pi,t0)=0 
sind  proceed  to  Step  2. 

steP  2  Use  Lemma  1  to  calculate  k  so  that  if  j  >  k  and  S  is  a 
sample  of  P^  then 

C(S,P  )  >  C(S,P.  ) 

J  10 


steP  3  Compute  the  first  integer  m  >  tQ  such  that 

C(S,P  )  <  c(0,m) 

0 

St.eP  A  :  2t  G(s)  denote  the  set  of  those  j,  1  <  j  <  k  for  which 
L(S,Pj,m)=0 

Step  5  Compute  CtSjPj)  ,  j  £  G(s) 
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§Lk5P  6  Let  i-L  be  the  first  i  €  G(s)  such  that 
C(S,Pi)  =  min  (CCSjP^)  |  j  €  G(s)) 
and  put  M(s)  =  P. 

Let  us  show  that  M(s)  has  the  desired  properties.  M  need 
choose  no  program  with  complexity  greater  than  C(S,P.  )  .  Step  2 

1o 

rules  out  programs  which  are  too  long  while  Step  3  rules  out  programs 

which  take  too  long  to  run  on  S  ,  hence  if  j  jg  g(s)  then  either 

S  is  not  a  sample  of  P  or  C(S,P.)  >  C(S,P.  )  so  (6)  holds  for 

3  3  xo 

M(S)  .  This  proves  the  Theorem. 

The  machine  M  constructed  in  the  proof  of  Theorem  1  will  in 

certain  cases  have  reasonable  convergence  properties  as  the  sample 

size  increases.  An  information  sequence  «9(P)  is  a  sequence  whose 

range  is  Jt( P)  ,  An  initial  segment  S^  is  the  sample 

Sn  =  fj(P).  |  1  <  i  <  n) 


Given  an  information  sequence  J(P)  ,  Pg3  ,  the  machine  M  will 
eventually  be  correct  on  any  input  for  which  P  is  defined,  that  is 

(?)  If  (*»y)€*(P),  then  there  is  an  N  such  that  (x,y)€Jr(M(s  )) 

for  n  >  N  . 


This  follows  easily  from  the  fact  that  Sc  jr  (M(S))  and  that  (x.y)€S 

”  n 

for  large  enough  n  . 

It  may  not  be  possible  to  obtain  J>( P)  ci(M(Sn))  for  n 
large.  If  f  is  a  recursive  total  function  then  it  may  happen  that 
any  program  for  f  has  such  rapidly  growing  running  time  that  M(S  ) 
merely  a  table  for  Sn  .  In  other  words,  if  the  running  times 


8 


» 


for  programs  for  f  are  all  unbounded  then  size  becomes  irrelevant 
m  the  complexity  measure.  If  the  running  time  is  bounded  then  the 
machine  of  Theorem  1  will  eventually  pick  only  programs  which  agree 
with  f  wherever  f  is  defined. 

ffl??rem  2  Suppose  c9(P)  is  an  information  sequence  for  some  program 
P63  and  that  C(Sn,P)  is  bounded  as  rn®  .  Then  for  the  machine 
M  of  Theorem  1,  we  will  have 

^(M(Sn))  3^r(P)  for  n  large  enough. 


Proof  Let  iQ  denote  the  first  index  i  for  which  J^)  3  ^(p) 

and  C(S  ,  P  )  is  bounded  as  rw®  .  Pit  b  =  lub  C(S  ,  P.  )  and 

n  n’ 

choose  K  so  that 


c(l(pk),o)  >  b 

The  programs  Pk  for  k  >  K  will  never  be  M(Sn)  for  their  complexity 

must  be  larger  than  that  of  P  on  S  .  Furthermore  if  k  <  K 

0  n 

and  MPk)£M*)  ,  we  can  choose  n^  so  that  S  will  not  be  a 

Tt 

sample  of  -fr(Pk)  .  Thus  if  n  is  large  enough,  M(Sn)  must  be  one 
of  the  programs  P.  for  which  i  <  K  and  ^(P.)d^(P)  ,  This  proves 
the  theorem. 


Notice  that  if  P  is  total,  M(s)  will  eventually  be  only  P 

J 

such  that  i'(Pj)  =  J>(P)  .  This  behaviour  is  called  matching  in  the 


literature  on  grammatical  inference  f  7]. 
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Corollary  Suppose  that  for  all  information  sequences  *}(P)  of  a 
given  P  6  C  ,  the  limit 

Y (P)  =  lim  C(Sn,P) 

exists  for  all  P  such  that  jj'(P)  3  J'(P)  •  Let  y  be  the  minimum  of 
these  Y(P)  .  Then  for  n  sufficiently  large,  y(M(sn)  =  V  • 

Proof  As  the  proof  of  Theorem  2  indicates,  there  is  a  K  such  that 
for  all  n  ,  M(Sn)  is  one  of  the  programs  Pi  ,  1  <  i  <  k  ,  and  for 
n  large  enough 

J-(M(Sn))  3i(P) 

Suppose  i  <  K  and  Jf(P^)  3  ■i'(P)  ,  y(P^)  =  y  •  ^  j  £  K  , 

jgr(P.)3^(P)  and  y(P-)  >Y  then  for  n  large  enough 
J  J 

C'Sn,  Pj)  >  (y(Pj)  +  y)/2 

so  M(S  )  will  not  be  P.  .  This  proves  the  Corollary, 
n  J 

Theorem  2  can  be  applied  in  any  case  where  the  program  has 
bounded  running  times.  A  slight  modification  enables  one  to  use  this 
result  in  the  case  when  a  bounding  function  for  the  running  time  is 
known.  To  simplify  our  discussion  we  shall  assume  that 

(8)  T(S,P)  =  max  T(x,y,P)  • 

(x,y)€S 

We  could  extend  our  results  (Theorem  3)  to  more  general  tp  ,  but  the 
extensions  do  not  seem  to  warrant  the  additional  complexity  of  proof. 
We  continue  to  search  the  right  generalization  of  Theorem  3.  A 
recursive  total  function  b  of  two  variables,  which  is  increasing 
in  both  is  called  a  bounding  function.  The  running  time  of  P  €  C 
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is  bounded  by  b  If  there  is  an  „  >  0  such  that 
T(x,y,  p)  <u.b(x,y)  ,  (x>y)  (  j(p) 

We  will  call  the  least  such  »  the  bounding  constant  of  p  .  A 

bounding  function  gives  rise  to  a  new  running  time  defined  for 
(x,y)  €  Jr(P)  by 

Tb(x,y,  P)  =  T(x,y,  P)/b(x,y) 
and  for  S  c  J(P)  by 

Tb(S,P)  =  max  T  (x,y,  P) 

(x,y)€S  D 

This  in  turn  gives  rise  to  the  complexity  measure  defined  by 
Cb<S,P)  =  cfL(P),  Tb(s,P)) 

Ihus  if  samples  8  are  drawn  from  a  program  P  „hlch  is  known 
to  have  its  times  bounded  by  a  bounding  function  b  ,  then  one  can 
choose  programs  ^(S)  of  minimal  c,  complexity  and  know  that 
^(s))  3l(P) 

if  the  sample  S  is  large  enough. 


ia*U*-  -  mer0  ar°  a  mbcr  °f  of  computations  with  known 

bounding  functions  in  terms  of  various  types  of  programs  or  machines 

ri3].  the  complexity  measure  cb  will  be  more  sensitive  if  the 
bounding  function  chosen  is  a  tight  one.  Thus,  if  we  know  a  computa¬ 
tion  has  polynomial  bounds  we  should  try  to  find  the  particular 
polynomial  rather  than  Just  choose  some  b  that  grows  faster  than  any 

polynomial.  A  bounding  function  that  is  too  large  may  give  rise  to 
degenerate  measures,  of  Remark  1. 
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It  may  even  be  possible  to  infer  a  good  boding  Action  an 

Part  of  the  general  procedure  for  program  inference.  Here  we  describe 

method  for  doing  this.  Suppose  O^J  is  a  sequence  of  bounding 
functiona  satiating  ^  .  1  ,  bk(x,y)  <  ylif)  ^ 


^  fey  >  T  -  u*W, 

bk+jTx^yJ  y*»  bk^"(x,y) 


We  Will  now  show  how  to  infer  both  a  bounding  function  and  then  good 
programs  which  run  on  the  sample  in  that  bound. 

^■°r^  3  Let  a  be  a  Class  wlth  complexity  measure  C  (where 

(,)  is  given  by  (8))  and  let  be  a  sequence  of  bounding 

functions  (satiating  (9)).  There  ia  an  inference  machine  „  „hich 

will,  for  any  information  sequence  J(P)  ,  P  €  0  ,  lnfer  „oth  a 

sequence  of  positive  integers  (f  l  „ 

8  n  “d  Programs  M(Sn)  such  that 

(a)  M(sn)  is  a  program  in  3  of  least  a_  complexity 

k 

whose  graph  contains  S  .  n 

If,  furthemore,  there  is  some  program  p  such  that  >(p)c^(p)  and 

P  has  its  running  times  bounded  by  some  b  then  for  „  , 

°k  ’  nen  *or  n  large  enough 

(b)  kn  =  k  ,  a  constant 

(C)  4f(M(Sn))  3  Jr(p) 


Pioof  Let  (P.)  be  an  Occam's  enumeration  for  £  relative  to  L  . 
M  will  use  a  sequence  (%,  ,  ^  being  the  c„rent  ^  as  ^  ^  ' 

bounding  constant.  Initially  ^  .  ■  .  The  machine  proceeds  as 


follows  to  obtain  k  and  M(S  ) 

n  n'  * 
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Step  1  For  each  i  ,  1  <  i  <  n  and  k  ,  l<k<n5M  computes 

(10)  6  (i,k)  =  max  d(x,y,P,  fit  b  (x,y ) ) 

(x,y)€Sn  1  nk 

If  6  (i.k)  =  1  for  1  <  i  ,  k  <  n  then  M  sets  k  =  1  and  goes 

n  —  —  n 

to  Step  3.  Otherwise  M  goes  to  Step  2  . 

Step  2  M  selects  k  as  the  first  index  k  such  that  6  (i,k)  =  0 
- -  "  n  n 

for  some  i  <  k  .  M  then  selects  i  as  the  first  index  such  that 

—  n 

6  (i,k  )  =  0  .  M  then  selects  k  as  the  least  integer  such  that 
nv  *  n'  n 

6  (i  ,k  )  =  0  and  goes  to  Step  3  . 
nv  n*  n' 

Step  3  If  6n(i,k)  =  0  for  some  i  and  k  l<i<n,l<k<n 

and  if  i  =  i  for  some  m  <  n  then  M  sets  or  =  a  .  Other- 
n  m  n+1  n 

wise  M  sets  of  =  1  +  a  . 

n+1  n 

Step  4  M  selects  M(Sn)  as  the  best  program  using  the  algorithm  of 

Theorem  1  with  the  measure  C, 

bk 

n 

Let  us  now  show  that  (a),(b),(c)  hold.  Condition  (a)  follows 
directly  from  that  fact  that  Step  4  uses  the  algorithm  of  Theorem  1  . 

Consider  the  set  of  pairs  (i,k)  such  that  bk  bounds  the 
running  time  of  and  i^(P^)  ojKP)  •  We  will  show  that  if  R 

is  not  empty  then  (b),  (c)  hold.  Towards  this  end  let  (ij&)  be 
the  pair  in  %  which  minimize  the  maximum  of  i  and  k  for 
(i,k)6^  .  (in  case  of  ties  we  choose  the  one  which  comes  first  in 
the  lexicographic  ordering  of  pairs). 
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We  let  c*4  be  the  least  constant  such  that 
<*(x,y,r4,a4  b£(x,y))  =0  ,  for  all  (x,y)  €  J>(P) 

and  consider  two  cases 

Case  1  The  machine  M  makes  at  least  a 4  different  guesses  of  ifi  . 

In  this  case  for  n  large  we  always  have  6n(i,ft)  =  0  .  If 
l  is  not  in  it  is  because  there  is  an  (i,k)  pair  found  first.  In 
particular  we  have  in  <  kn  <  max  {i,ft}  so  that  <*n  must  be  eventually 
constant.  Ey  our  choice  of  (i,ft)  any  pair  (i,k)  of  lower  max{i,k} 
will  eventually  be  rejected  because  i'(P^)  doesn't  contain  J>(P) 
or  because  6n(i,k)  =  1  for  all  k  such  that  max{i,k)  <  max{l,ft}  . 

Thus  in  Case  1,  in  will  eventually  be  i  and  k^  will  eventually 
be  ft  . 

Case  2  The  machine  M  makes  a  different  guesses  of  i  and 
-  n 

a  <  . 

In  this  case  we  consider  the  class  %  of  pairs  (i,k)  with  the 
following  properties 

i)  ^(P.)o^(P) 

ii)  d(x,y,Pi,o  bR(x,y)  =  0  ,  for  all  (x,y)6*(P) 

If  ^  were  empty  M  would  make  more  than  a  guesses.  It  is  easy 
that  if  n  is  large  enough  we  will  have  in  =  T  and  kfi  =  k  where 

VX)  =  min  max{i,k) 

where  ties  are  again  broken  by  taking  the  least  number  in  the  lexicographic 
order.  This  completes  the  proof  of  Hieorem  3  . 
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One  might  think  that  Theorem  3  can  be  formulated  so  that  M 
can  actually  infer  the  least  integer  k  such  that  some  program  whose 
graph  contains  Jr(P)  has  its  running  time  bounded  by  b^  .  We 
suspect  that  this  cannot  in  general  be  done. 

Example  Suppose  f^,  fg  are  recursive  functions  such  that  there  are 
programs  PCf^),  P(fg)  which  compute  each  argument  in  time  .  b^  and  bg  , 
respectively,  and  that  no  program  does  better  for  infinitely  many 
arguments.  Let 

(f  (k)  ,  n  =  2k-l 

f(n)  =  J  X 

/  f2(k)  ,  n  =  2k 

/  •  \ 

and  consider  the  sequence  of  programs  P  such  that 

uses  P(fx)  to  compute  for  n  odd,  is  undefined  for 
n  =  2k  ,  k  >  i  ,  and  computes  f(2k)  ,  k  <  i  by  a  table. 

Thus  the  program  length  L(P^)  will  be  unbounded,  yet  the  running 
time  of  P^  will  be  bounded  by  b1  .  If  an  inference  scheme  considers 
only  a  bounded  number  of  programs  one  may  be  able  to  infer  that  f 
can  be  computed  in  bk  time  for  some  k  >  2  .  If,  however,  the 
scheme  considers  more  and  more  programs,  one  eventually  encounters 
the  P^)  which  would  cause  the  erroneous  guess  of  time  bound  b^  . 

We  have  not  been  able  to  convert  examples  of  this  sort  into  a 
proof  that  no  machine  can  always  find  the  lowest  k  for  which  there 
is  a  Pj  with  Jr(P)  c  Jr(Pj)  and  T(Sn,  P^ )  <  bk(Sn)  .  However,  we  do 
know  that  any  machine  that  attemps  to  always  find  the  lowest  possible 
k  will  have  to  look  at  arbitrarily  many  P^^  for  some  functions. 

This  can  be  forced  by  taking  some  function  of  class  k  and  replacing 


\ 
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it  on  a  finite  number  of  arguments  with  a  function  of  class  k+1  . 


'Phis  suggests  the  following  modification  to  Theorem  3. 

We  supply  the  machine  M  with  an  auxiliary  function  A(£,k) 
which  maps  a  size  and  a  bounding  index  into  a  size.  This  tradeoff 
function  A(i,k)  determines  the  size  of  program  to  be  considered  in 
searching  for  an  improvement  to  an  answer  P.  of  size  1  and  complex- 
ity  index  k  .  Intuitively,  A(£,k)  says  that  the  user  of  the  inference 
machine  M  prefers  a  program  of  class  k-1  and  size  A(i,k)  to  a 
program  of  size  j l  and  class  k  . 

There  is  a  "natural"  A  function  derived  from  the  complexity 
function,  namely 


A(£,k)  =  first  size  m  such  that 

c(£,T  (S,P ) )  >  c(m,0)  . 

Dk 

The  construction  for  Theorem  3  can  easily  be  modified  to  include 
A(£,k)  .  This  still  does  not  guarantee  the  minimum  value  for  k  , 
but  seems  to  be  a  natural  model  of  inference  processes. 


Remark  3  There  has  been  a  considerable  amount  of  work  [12]  on 
complexity  classes  of  functions.  To  remain  consistent  with  this  work, 
we  would  have  to  restrict  the  choice  of  cp(U  (T(x,y,P)) )  to  ones 
which  give  the  same  complexity  classes  as  cp  =  ^s^(Tb(x,y,Pj)  . 

A  good  choice  would  be 

T  (S,P.)  =  max  T  (x,y,P  )  +  1  Z  T  (x,y,P  )  . 

b  J  (x,y)€S  b  J  Tsf  (x,y)€S  b  J 

This  (max  +  average)  measure  gives  the  s?me  classes  as  max,  and  also 
distinguishes  among  programs  with  the  same  maximum  time.  Since  the  ratio 
of  this  measure  with  the  max  measure  is  bounded  away  from  0  and  ®  , 

16 


Theorem  3  holds  for  it  also.  Furthermore,  it  is  also  bounded  away  from 
zero,  avoiding  the  degeneracy  problem  for  the  usual  choices  of  C(L,T)  . 

The  results  derived  here  for  programs  have  a  significantly 
different  flavor  from  those  developed  [ 7  ]  for  grammars .  A  central 
issue  in  grammatical  inference  is  the  presence  or  absence  of  negative 
information,  i.e.,  strings  in  a  sample  marked  as  not  belonging  to  the 
language  being  learned.  This  problem  does  not  arise  in  program  inference 
for  two  reasons.  With  grammars,  an  answer  which  generates  too  many 
strings  is  normally  considered  wrong,  but  our  constructions  allowing 
answers  whose  graph  includes  that  of  the  hidden  function  seem  quite 
natural.  This  arises  from  the  single -valuedness  of  functions 
if  (x,y)  appears  in  a  sample  then  no  (x,y  )  with  y  j=-  y  can 
appear.  When  l(M(S))3i(P)  ,  M  has  simply  chosen  a  program  which 
may  be  defined  for  some  arguments  where  P  is  not.  If  one  attempted 
to  extend  our  results  to  relations,  the  problems  associated  with  negative 
information  would  reappear. 

The  results  of  this  paper  should  be  viewed  in  the  context  of  a 
renewed  interest  in  inductive  and  scientific  (hypothetico-deductive) 
inference.  In  addition  to  the  theoretical  work  on  programs  and  grammars, 
there  is  work  on  predicate  calculus  [16]  and  real  chemistry  [  5  ]  • 

All  of  these  efforts  have  applied  as  well  as  theoretical  components. 

Some  of  our  work  on  program  inference  is  discussed  in  [  8 ]  and  [  1  ] 
and  there  is  a  fairly  ambitious  effort  underway  to  infer  loop  programs 
from  sample  traces.  Thus  far,  there  has  beep  surprisingly  little 
carryover  from  one  domain  to  the  other  and  from  theoretical  results 
to  programs,  but  a  common  understanding  of  the  issues  seems  to  be 
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emerging.  There  are  also  proposed  applications  of  inference  techniques 
to  pattern  recognition  [  9  ]  and  natural  language  description  [14] 
which  provide  constant  reminders  of  the  weakness  of  existing  results. 
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